Top-k Correlation Computation
نویسندگان
چکیده
Recently, there has been considerable interest in efficiently computing strongly correlated pairs in large databases. Most previous studies require the specification of a minimum correlation threshold to perform the computation. However, it may be difficult for users to provide an appropriate threshold in practice, since different data sets typically have different characteristics. To this end, in this paper, we propose an alternative task: finding the top-k strongly correlated pairs. Consequently, we identify a 2-D monotone property of an upper bound of φ correlation coefficient and develop an efficient algorithm, called TOP-COP to exploit this property to effectively prune many pairs even without computing their correlation coefficients. Our experimental results show that TOP-COP can be an order of magnitude faster than alternative approaches for mining the top-k strongly correlated pairs. Finally, we show that the performance of the TOP-COP algorithm is tightly related to the degree of data dispersion. Indeed, the higher the degree of data dispersion, the larger the computational savings achieved by the TOP-COP algorithm.
منابع مشابه
Top-K Correlation Sub-graph Search in Graph Databases
Recently, due to its wide applications, (similar) subgraph search has attracted a lot of attentions from database and data mining community, such as [13, 18, 19, 5]. In [8], Ke et al. first proposed correlation sub-graph search problem (CGSearch for short) to capture the underlying dependency between subgraphs in a graph database, that is CGS algorithm. However, CGS algorithm requires the speci...
متن کاملEfficiently Processing of Top-k Typicality Query for Structured Data
This work presents a novel ranking scheme for structured data. We show how to apply the notion of typicality analysis from cognitive science and how to use this notion to formulate the problem of ranking data with categorical attributes. First, we formalize the typicality query model for relational databases. We adopt Pearson correlation coefficient to quantify the extent of the typicality of a...
متن کاملAggregation-Aware Top-k Computation for Full-Text Search
A typical scenario in information retrieval and web search is to index a given type of items (e.g., web pages, images) and provide search functionality for them. In such a scenario, the basic units of indexing and retrieval are the same. Extensive study has been done for efficient top-k computation in such settings. This paper studies top-k processing for many emerging scenarios: efficiently re...
متن کاملRWTH Aachen University , I 5 Max - Planck - Institut für Informatik , AG 5 Holistic Top - k
Querying large data sets is a challenging task in today’s information systems. Users are typically interested in the k most relevant results, namely the first page (e.g., the Google search engine) of the given result set. That is, given a dataset D, and user defined similarity function f, we are interested in calculating the top-k , i.e., the k highest ranked results (answers). Finding the top-...
متن کاملDiscovery of Top-k Dense Subgraphs in Dynamic Graph Collections
Dense subgraph discovery is a key issue in graph mining, due to its importance in several applications, such as correlation analysis, community discovery in the Web, gene co-expression and protein-protein interactions in bioinformatics. In this work, we study the discovery of the top-k dense subgraphs in a set of graphs. After the investigation of the problem in its static case, we extend the m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- INFORMS Journal on Computing
دوره 20 شماره
صفحات -
تاریخ انتشار 2008